Synthetic CT generation based on CBCT using improved vision transformer CycleGAN

Cone-beam computed tomography (CBCT) is a crucial component of adaptive radiation therapy; however, it frequently encounters challenges such as artifacts and noise, significantly constraining its clinical utility. While CycleGAN is a widely employed method for CT image synthesis, it has notable limitations regarding the inadequate capture of global features. To tackle these challenges, we introduce a refined unsupervised learning model called improved vision transformer CycleGAN (IViT-CycleGAN). Firstly, we integrate a U-net framework that builds upon ViT. Next, we augment the feed-forward neural network by incorporating deep convolutional networks. Lastly, we enhance the stability of the model training process by introducing gradient penalty and integrating an additional loss term into the generator loss. The experiment demonstrates from multiple perspectives that our model-generated synthesizing CT(sCT) has significant advantages compared to other unsupervised learning models, thereby validating the clinical applicability and robustness of our model. In future clinical practice, our model has the potential to assist clinical practitioners in formulating precise radiotherapy plans.

where ℓ GAN is the classification loss function.0 and 1 are the class labels of the generated and real images respectively 36 .
The generators are updated by backpropagating loss from three sources: GAN loss, cycle-consistency loss, and identity-consistency loss 36 .Using G A→B as an example: (1)

IViT-CycleGAN architecture
The original generator of CycleGAN can only retain and convey local feature information, lacking the ability to capture global features, thereby resulting in subpar image quality and authenticity.To address this limitation, this research incorporates a ViT-based U-net framework into the generator, as depicted in Fig. 2.
Firstly, the U-net architecture is employed to extract and retain crucial organizational features and detailed information, effectively resolving the issue of information loss through the utilization of skip connections.Subsequently, the self-attention mechanism of the transformer is employed to automatically prioritize information from various positions within the image during image generation, enhancing the comprehension of the global structure within organizational images.Lastly, a deep convolutional network is introduced into the feedforward neural network to concentrate on regions with more intricate details, resulting in clearer and more realistic    www.nature.com/scientificreports/generated images.Specifically, the coding path of U-net extracts features from the input through four layers of convolution and downsampling, and passes the extracted features from each layer to the corresponding layer of the decoding path through skip connections.In the encoding path of U-net, the preprocessing layer converts the image into a tensor with dimensions ( w 0 ,h 0 , f 0 ), and the preprocessed tensor halves the width w 0 and the height h 0 in each downsampled block while the feature dimension f 0 is doubled 36 .
For the ViT module, as shown in Fig. 3, ViT is composed primarily of a stack of transformer encoder blocks.To construct an input to the stack, the ViT first flattens an encoded To construct an input to the stack, the ViT first flattens an encoded image along the spatial dimensions to form a sequence of tokens.The token sequence has length w × h , and each token in the sequence is a vector of length f.It then concatenates each token with its two- dimensional Fourier positional embedding of dimension f p and linearly maps the result to have dimension f v 36 .
For the feedforward neural network, as shown in Fig. 4, we use a deep convolutional network instead of the original fully connected layer.The input, i.e. a sequence of tokens is first reshaped to a feature map rearranged on a 2D lattice.Then two 1 × 1 convolutions along with a depth-wise convolution are applied to the feature map 38 .After that, the feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.To improve the Transformer convergence, we adopt the rezero regularization scheme and introduce a trainable scaling parameter α that modulates the magnitudes of the nontrivial branches of the residual blocks.The output from the transformer stack is linearly projected back to have dimension f and unflattened to have width w and h.In this study, we use 12 transform encoder blocks 36 .

Discriminator loss with gradient Penalty (GP)
To improve the training stability, we introduce a generalized GP 36,39 form with the following D A loss formula: where L disc,A is defined as in Eq. 1, and L disc,B follows the same form.In our experiments, this γ-centered GP regularization provides more stable training and less sensitive to the hyperparameter choices 36 .www.nature.com/scientificreports/

Pixel-wise consistency loss
To improve the consistency of the generated and source images, we experiment with the addition of an extra term L consist 40 to generator loss.This term captures the L 1 difference between the downsized versions of the source and translated images.For example, for images of domain A: where F is a resizing operator down to 32× 32 pixels (low-pass filter).We add this term to the generator loss with a magnitude consist for both domains 40 .

Ethical statement
We confirmed that all methods were carried out in accordance with relevant guidelines and regulations, and informed consent for patients was waived by the Research Ethics Committee of the Nanjing Medical University.
All experimental protocols and data in this study were approved by the Research Ethics Committee of the Nanjing Medical University.Approval number: NMUE2021301.

Experiments Data acquisition
In this study, we test our proposed method in two datasets provided by a cooperative tertiary hospital.

H &N dataset
The CBCT and CT images were selected from 30 patients who were received volume modulated arc therapy(VMAT) in the head and neck (H &N) for nasopharyngeal and hypopharyngeal(NPC,HC) cancer from October 1,2021 and September 1,2023.The CT volumes were obtained with the dimensions 512×512 on the axial plane with a pixel size of 0.625x0.625mm 2 and a slice thickness of 2.75 mm using GE discovery position- ing system.The CBCT volumes were obtained using Elekta XVI Systems.The appiled images protocol were the following parameters: 200 degrees gantry rotation,100kV p ,10mA,10ms, and F0S20collimator.And images had a size of 384×384 on the axial plane.Every patient contain with 1 planning CT volume taken in the positioning before treatment and 3 CBCT volumes every week taken between treatment.We randomly divided the training set and test set according to 8:2.

Chest dataset
The Chest dataset had the same acquisition time as the H &N dataset.The CT and CBCT parameters had few differences.It consisted of 30 patients, each with 1 planning CT volume taken before treatments and 3-5 CBCT volumes per week taken between treatments.The CT volumes were obtained with the dimensions 512 × 512 on the axial plane with a pixel size of 0.625 × 0.625 mm 2 and a slice thickness of 5 mm using GE discovery positioning system.And the CBCT images were acquired with the following parameter: 360 degrees gantry rotation,100 kVp, 10 mA, 10 ms, and F0M10 collimator.and the CBCT volumes were reconstructed at mediumresolution ( 1 × 1 × 1 mm 3 voxels) on a 410 × 410 × 120 matrix.We randomly divided the training set and test set according to 8:2.

Data processing
During the scanning process of CBCT and CT, non-human structures (such as treatment beds, fixation devices, and masks) are captured in the resulting images.These structures not only impede model training speed but also compromise the quality of synthesized images.To mitigate these issues, denoising is essential to eliminate the interference of irrelevant information prior to model training.In this study, the outlines of CT contours manually annotated by doctors serve as masks.These masks are subsequently multiplied with the corresponding CT images to generate clean CT images.Likewise, the masks are also multiplied with the CBCT images to produce clean CBCT images suitable for training purposes.Distinctive characteristics distinguish CBCT from CT, encompassing variations in imaging hardware, clinical protocols, and scanning configurations.Matching scans from the same patient often presents a challenge due to the inherent differences.Recognizing the stability of organ positions and reduced tissue mobility during data acquisition, we leveraged the open-source advanced normalization tools (ANTs) for affine registration, with the primary objective of ensuring alignment between each CBCT and CT pair for the purpose of model test.

Evaluation
To accurately compare the similarity between the sCT images generated by different models and the CT images, we introduced quantitative evaluation metrics such as mean absolute error (MAE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM).A lower MAE value indicates less difference between sCT and CT, resulting in more realistic image generation.Conversely, higher values of PSNR and SSIM indicate greater similarity between sCT and CT, leading to higher construction quality and more realistic images.These metrics are defined as: www.nature.com/scientificreports/n denotes the number of testing slices, and x and y denote the CT and the CBCT, respectively.The CBCT generates the predicted CT after the CBCT generator G CBCT→CT , and then calculates the the absolute value error with the CT.
MAX rCT denotes the maximum pixel value in the sCT, and MSE is the mean square error.a larger PSNR indicates a higher similarity between the generated CT and the real CT, which means that the quality of the generated CT is better.
where x and y represent the CT and the fake-CT generated by the CBCT after the generator G CBCT→CT , respec- tively, and u x and u y represent the mean of x and y, σ 2 x and σ 2 y represent the variance of x and y, and σ xy represents the variance of x and covariance of y, while c 1 and c 2 are the two constants used to maintain stability.The value of SSIM ranges from −1 to 1, and the larger the value, the more similar the two images are.

Network training
In the experiments, all images are normalized to [ −1,1] and resized to 256 × 256.we train the generator five times and then train the discriminator once.Its parameters are set to epoch = 200 and batch size= 5.The other comparison methods are implemented based on the code and details provided by the authors and have the same hyperparameter settings as ours.All algorithms in this study were implemented on a Linux system equipped with four NVIDIA Tesla V100s using Python 3.6 (https://www.python.org/downloads/release/Python-362/)and Tensorflow 1.14 (https://tensorflow.google.co.uk/versions) implementations.Figure 5 shows a plot of the discriminator loss function versus the number of iterations on the two datasets, H &N and Chest, which shows that the proposed method in this paper (Ours) has faster convergence and is better trained.

Comparison of different methods
Tables 1 and 2 present a comparison of quantitative results between our method and several contrast algorithms.Our proposed method demonstrates superior performance to CycleGAN and its variations in all three evaluation metrics mentioned above on the H &N dataset.This is attributed to the introduction of an improved ViT-based U-net framework.This framework enables the extraction and preservation of essential features and detailed information, automatic focus on information from different positions in the images, better comprehension of the global structure of the images, and emphasis on regions with more details.Consequently, the generated images are clearer and more realistic.Furthermore, we apply our method to the Chest dataset without altering any parameters and report the experimental results in Table 2. Our method exhibits considerable advantages on the Chest dataset as well, outperforming CycleGAN and its variations in the evaluated metrics.The key factor behind this success is the incorporation of GP and additional pixel-wise consistency loss, which enhance the stability and robustness of the model.These experiments validate the applicability of our method not only to the head and neck region but also to other parts of the body.
Figure 5. Plot of loss function vs. number of iterations on both datasets.www.nature.com/scientificreports/

Ablation studies for IViT-CycleGAN
To thoroughly assess the efficacy of each module, we employed CycleGAN-ViT as our backbone and conducted a module stacking analysis to gauge their individual impact on the overall performance.The abbreviations used are as follows: Depth-Wise Convolution Network (DCN), Discriminator Loss with Gradient Penalty (GP), and Pixel-wise Consistency Loss (PL).The results from the H &N dataset are compiled in Table 3, while those from the Chest dataset are presented in Table 4. www.nature.com/scientificreports/ The experimental findings substantiate the contributions of each module to the overall performance.DCN due to its inherent local properties, complements the self-attention mechanism in ViT, enabling it to engage in both global and regional information exchange.This local focus facilitates the extraction of finer details, resulting in more vivid and realistic generated images.GP, through regularization controlled by the parameter β , improves model stability during training.PL, by measuring the L1 difference between source and generated images, enhances consistency, thereby enhancing image generation quality.A comparative analysis of metrics, including MAE, PSNR and SSIM, reveals that DCN exhibits the most significant performance boost compared to GP and PL.

Visualization
In addition to quantifying the sCT using the aforementioned evaluation indicators, we incorporate visualization techniques to explore the results from various perspectives and validate the effectiveness of our proposed method by comparing the outputs of different models.In Fig. 6, we present a comparison of synthesis results obtained from six algorithms, namely CycleGAN, DualGAN, AttentionGAN, RegGAN, ADCycleGAN, and Ous , in H &N patients.These results showcase the generated images of the cervical Bone (marked by green arrows), Nasopharynx (marked by blue arrows), Pituitary (marked by yellow arrows), and Eyes (marked by orange arrows) in a sequential left-to-right and top-to-bottom manner.Upon closer examination of the magnified images, it is evident that the alternative algorithms yield images with increased noise levels and significant loss of lesion details.In contrast, our proposed method generates images characterized by minimal noise, enhanced details accuracy, and closer resemblance to Real CT results.This remarkable outcome can primarily be attributed to the incorporation of the ViT-based U-net framework, which excels in feature extraction while preserving crucial detailed information.The framework also demonstrates improved comprehension of the image's global structure, resulting in the production of images that are significantly clearer and more realistic.
We also conducted a comparative analysis of the synthesis results obtained from the six algorithms on the Chest dataset, as shown in Fig. 7.The first and second rows illustrate the Bronchi (highlighted by green arrows) and the Sternum (highlighted by blue arrows) as the representative anatomical structures, respectively, while the third and fourth rows depict the lung tumor regions.A thorough examination of the partially enlarged images reveals that our method applied to the Chest dataset produces consistent outcomes with those observed in the H &N dataset.Specifically, the generated images exhibit diminished noise, enhanced detail accuracy, and a greater resemblance to Real CT scans.Notably, our proposed algorithm demonstrates a more striking similarity to Real CT scans in the tumor regions, which proves instrumental in discerning changes within the tumor areas and offering valuable image references for adaptive radiotherapy.The Hounsfield Units (HU) CT values, reflecting tissue density and X-ray absorption, were analyzed in the test set slices ranging from −500 to 500 HU. Figure 8 exhibits a comparative histogram of HU values for our method and CBCT, revealing that the curve shapes and peak positions of our approach more closely resemble those of real CT scans, suggesting a certain level of accuracy in the generated sCT.Figures 9 and 10 present difference plots comparing our method, CBCT, and ground truth CT.Utilizing a rainbow color mapping, with blue indicating minimal difference and red indicating maximum, the plots demonstrate that the discrepancies between our method and CT are significantly smaller than those between CBCT and CT, indicating that the sCT generated in this study achieves a CT-like quality to a considerable extent.In addition to comparing the generated details, we also assess the performance of different methods using CT values.We use each pixel on the vertical and horizontal axes as a unit and calculate the average CT value at each pixel of all the data in the entire test set.Figures 11 and 12 showcase the distribution of average CT values across the vertical and horizontal axes.The x-axis denotes the pixel position, while the y-axis represents the average pixel value.The purple curve corresponds to the CT value distribution curve of Real CT, the red curve represents our proposed method, and the remaining colors indicate other methods.The outcomes highlight the congruence of CT value distribution trends between Real CT and the other five methods.Notably, our CT value distribution curve bears a stronger resemblance to Real CT when juxtaposed with the curves of the other five methods.Meanwhile, the CT values obtained from the other methods slightly surpass those of Real CT.In essence, our method generates sCT images that closely approximate Real CT, thus rendering them more authentically realistic.

Visualization of the synthetic CT 3D reconstruction
The CT's imaging process, characterized by its distinct mechanism, yields two-dimensional data in the form of X-ray-derived slices.In clinical settings, three-dimensional reconstructions are crucial for multi-faceted analysis of patient lesions, facilitating accurate diagnosis and treatment.To evaluate the fidelity of our sCT, we conducted  3D reconstructions, focusing on the consistency across dimensions.Figures 13 and 14 illustrate the results for the H &N and Chest datasets, featuring axial, sagittal, and coronal perspectives.Our method generates the axial view, while the sagittal and coronal views are reconstructed from the sCT.The reconstructed slices from these datasets evidence that the generated sCT successfully retains the original anatomical integrity, ensuring a consistent representation of internal organ structures.For the Dose and Volume Histogram (DVH) on the right side of Figs. 13 and 14, where the solid line represents Real CT and the dashed line represents the Ours method, it can be seen that the close proximity of our method to the real clinical dose distribution validates the clinical applicability of our method.

Dose calculation
The primary purpose of sCT is to serve as a foundation for subsequent clinical tasks, particularly dose calculation.Hence, dose calculation offers the most precise approach to verifying the effectiveness of sCT generation and its clinical suitability.To this end, we conducted a comparison between the sCT generated by our method and the Real CT across different dose levels.The resulting discrepancies were then visualized in three-dimensional displays, as presented in Figs. 15 and 16.On the left side of Fig. 15, the differences between our method and the Real CT treatment plan for a nasopharyngeal cancer patient under different dose distributions are shown.On the right side of Fig. 15, the differences in DVH for the patient are displayed, with the solid line representing Real CT and the dashed line representing our method.It can be observed from the left side of Fig. 15 that our method closely approximates the actual clinical dose distribution under different dose distributions.Examination of the DVH on the right side of Fig. 15 reveals nearly no disparity in the preventive dose for the nasopharyngeal target area and lymphatic drainage area.The experiment successfully validates the clinical applicability of our method.Furthermore, for Chest patients, we observed the discrepancies between our method and Real CT in sCT under different dose distributions.Based on the 3D dose distribution and DVH in Fig. 16, it can be concluded that there is also a slight disparity in the dose received by the target area and lung tissue, thereby further confirming the clinical applicability and robustness of our method.

Discussion
This research introduces IViT-CycleGAN, an unsupervised learning model designed to synthesize sCT from CBCT data.The selection of CycleGAN is driven by the practical challenge of obtaining paired CBCT and CT scans in clinical settings.Our approach enhances the original CycleGAN by incorporating a ViT-based U-Net generator, which effectively extracts and retains vital features and fine details.To further refine image generation, we integrate a deep convolutional network within the feedforward neural network, leveraging the Transformers' self-attention mechanism to enable automatic focus on diverse image regions, thereby improving global understanding and enhancing detail localization.A gradient penalty is introduced to ensure more stable training, and an additional loss term is added to the generator's objective function to capture discrepancies between the source and generated images.
Our model exhibits superior quantitative performance compared to prevailing unsupervised learning techniques, achieving state-of-the-art evaluation metrics across both datasets.Comprehensive ablation studies, detailed in Tables 3 and 4, consistently reveal the positive impact of our proposed modules on the model's overall efficacy.Of particular note, the DCN module stands out with a more substantial boost, attributed to its inherent local characteristics that are effectively modeled by the self-attention mechanism in ViT.This integration enables ViT to engage in both global and local information exchange, thereby enhancing its capabilities.
In visual assessments, we rigorously tested our model's superiority through extensive experiments.As depicted in Figs. 6 and 7, our model-generated images exhibit reduced noise and enhanced detail, closely resembling authentic CT scans.For the H &N dataset's first row, our sCT exhibits the closest resemblance to real CT at the conus region (green arrow), with ADCycleGAN and RegGAN also displaying comparable performance.However, in the nasal cavity, ADCycleGAN and RegGAN differ significantly from the real CT in shape.Cycle-GAN, DualGAN, and AttentionGAN exhibit larger discrepancies, characterized by blurred details and excessive noise in the conus area.In the nasopharynx (blue arrow) of the second row, ADCycleGAN, RegGAN, CycleGAN, and DualGAN present similar shapes with minor differences from the real CT, such as blurred boundaries and missing details.Our sCT stands out with clear details, while AttentionGAN performs the least favorably.In the pituitary region (yellow arrow) of the third row, our sCT most closely matches the real CT, with distinct boundaries and minimal shape variations.ADCycleGAN, RegGAN, and CycleGAN lose some details, and eye distortion is prominent.DualGAN and AttentionGAN generally underperform.In the eye region (orange arrow) of the fourth row, ADCycleGAN, RegGAN, and CycleGAN exhibit minimal differences from the real CT, but brain tissue distortion is severe.Our sCT excels, whereas DualGAN and AttentionGAN lag behind.For the Chest dataset, in the bronchial bifurcation (green arrow), ADCycleGAN, RegGAN, and CycleGAN exhibit smaller differences from the real CT, but overall image detail is lacking.Our sCT stands out, while DualGAN and AttentionGAN falter.In the conus region (blue arrow) of the second row, ADCycleGAN and RegGAN show less shape deviation compared to CycleGAN, DualGAN, and AttentionGAN, which exhibit the greatest discrepancies.Our sCT is the closest to the real CT in this region.In the lung tumor area (yellow dashed circle), ADCycleGAN and RegGAN have similar shapes to the real CT, but they lack information in the heart and conus regions.Our sCT outperforms others, with CycleGAN, DualGAN, and AttentionGAN performing the worst.In the lung tumor area of the fourth row (orange dashed circle), ADCycleGAN and RegGAN have slightly less shape deviation from the real CT, but they severely lack heart information.Our sCT demonstrates superior clarity with minimal distortion, while CycleGAN, DualGAN, and AttentionGAN remain inferior.The HU value histograms in Fig. 8 reveal that our method's curve shape and peak are closer to the real CT, indicating a certain level of fidelity.Figures 9 and 10 illustrate the difference plots between CBCT, our method, and CT for both datasets.The difference plots reveal that our method exhibits significantly smaller discrepancies compared to CBCT, indicating that the synthetic sCT generated by our approach approaches the CT standard to a certain extent.In summary, our model, trained on unpaired data, is capable of extracting and preserving crucial features and fine-grained details, automatically focusing on various image regions, and enhancing the understanding of Furthermore, in addition to visual presentations, this paper also evaluates the clinical significance through expert analysis.Figures 13 and 14 illustrate the 3D reconstructions, demonstrating that sCT preserves the original anatomical structures and maintains a certain continuity in internal organ tissues.Figures 15 and 16 showcase the distribution of our sCT under different doses compared to the real clinical dose.The results indicate that the dose difference between our sCT and the real CT is minimal, confirming the clinical applicability of our method.
While our approach outperforms other unsupervised learning models, the results on the Chest dataset remain relatively average.For future work, we plan to experiment with the state-of-the-art diffusion models in the image generation domain and further investigate 3D image generation capabilities.

Conclusions
This study proposes an unsupervised learning model, IViT-CycleGAN, aiming to synthesize sCT from CBCT for future clinical practice.IViT-CycleGAN presents a U-net framework, which is built upon the ViT architecture within the generator.This framework leverages the U-net structure to effectively extract and retain crucial features and intricate details.Moreover, we enhance the ViT model by integrating a deep convolutional network and the self-attention mechanism of Transformer into the feed-forward neural network.The objective is to automatically prioritize information from various image locations during the image generation process, leading to a better comprehension of the overall image structure and emphasizing regions with finer details.Consequently, the generated images exhibit enhanced clarity and realism.To enhance the stability of model training, a gradient penalty is introduced to ensure minimal variations in network weights for minor changes in the model's input.Additionally, an additional loss term is included in the generator loss to reinforce the consistency between the generated and source images by capturing their differences.The results demonstrate that IViT-CycleGAN outperforms other unsupervised learning models in terms of generating sCT, thus validating the clinical applicability and robustness of our model.In future clinical practice, this method can assist clinicians in developing radiotherapy treatment plans.

Figure 1 .
Figure 1.Schematic of the original CycleGAN model.

Figure 4 .
Figure 4. Depth-Wise Convolution Network.DW denotes depth-wise convolution.To cope with the convolution operation, the conversion between sequence and image feature map is added by Seq2Img and Img2Seq 38 .

Figure 6 .
Figure 6.Visualization of sCT generation details on the H &N dataset test set.

Figure 7 .
Figure 7. Visualization of sCT generation details on the Chest dataset test set.

Figure 8 .Figure 9 .Figure 10 .
Figure 8. Histograms of HU values for the two types of datasets.

Figure 11 .
Figure 11.Visualization of CT value distribution on the H &N dataset test set.

Figure 12 .
Figure 12.Visualization of CT value distribution on the Chest dataset test set.

Figure 13 .
Figure 13.3D reconstruction visualization of the H &N dataset.From top to bottom, representing axial, sagittal and coronal planes(left).Patient DVH differences, where the solid line is Real CT and the dashed line is sCT (right).

Figure 14 .
Figure 14.3D reconstruction visualization of the Chest dataset.From top to bottom, representing axial, sagittal and coronal planes(left).Patient DVH differences, where the solid line is Real CT and the dashed line is sCT (right).

Figure 15 .
Figure 15.Differences between our method and Real CT for different dose distributions on the H &N dataset (left).Patient DVH differences, where the solid line is Real CT and the dashed line is sCT (right).

Figure 16 .
Figure 16.Differences between our method and Real CT for different dose distributions on the Chest dataset (left).Patient DVH differences, where the solid line is Real CT and the dashed line is sCT (right).

Table 1 .
Comparison of metrics of different methods on the H &N dataset.Significant values are in bold.

Table 2 .
Comparison of metrics of different methods on the Chest dataset.Significant values are in bold.

Table 3 .
Quantitative results for ablations based on CycleGAN-ViT in H &N dataset.Significant values are in bold.

Table 4 .
Quantitative results for ablations based on CycleGAN-ViT in Chest dataset.Significant values are in bold.